Crosstalk cancellation and adaptive binaural filtering for listening system using remote signal sources and on-ear microphones

ABSTRACT

A listening system includes a first microphone device and a second microphone device that generate a first electronic signal and a second electronic signal corresponding to sound within audio detection range. Control logic of the first microphone device detects a crosstalk audio signal from a direction of the second microphone device that matches the second electronic signal. The first electronic signal includes a mixture that includes the crosstalk audio signal. An ear playback device is associated with the second microphone device. A processing device receives the first electronic signal and the second electronic signal, removes the second electronic signal from the first electronic signal to generate a cleansed first electronic signal, and processes the cleansed first electronic signal to integrate the cleansed first electronic signal into an output signal to the ear playback device.

RELATED APPLICATIONS

The present application claims the benefit under 35 U.S.C. § 119(e) ofU.S. Provisional Application No. 63/324,983 filed Mar. 29, 2022, whichis incorporated herein by reference.

TECHNICAL FIELD

Embodiments of the disclosure relate generally to listening systems, andmore specifically, relate to crosstalk cancellation and adaptivebinaural filtering for a listening system using remote signal sourcesand on-ear microphones.

BACKGROUND

Listening devices such as hearing aids and cochlear implants oftenperform poorly in noisy environments. Remote microphones, which transmitsound directly from a distant talker to the ears of a listener, havebeen shown to improve intelligibility in adverse environments. Thesignal from a remote microphone has less noise and reverberation thanthe signals captured by the earpieces of a listening device, effectivelybringing the talker closer.

Although remote microphones can dramatically improve intelligibility,remote microphones often sound artificial. In commercial devices, thesignal from the remote microphone is generally presented diotically,e.g., without accounting for delay between the ears. This signal matchesthe spectral coloration of the remote microphones rather than that ofmicrophones in the earpieces, and lacks interaural time and leveldifferences that humans use to localize sounds. Some modern efforts toresolve these issues are either too processing intensive to be practicaland/or employ external microphones that are not sufficiently close totalkers of interest, necessitating beamforming to achieve strong noisereduction. Such systems can be difficult/expensive to implement and aresensitive to motion.

Further, when remote microphones are employed near talkers at least someof which are using such listening devices, crosstalk is possible ingroup conversations. For example, participants in the conversation mayhear a delayed copy of their own speech that was picked up by themicrophone of a nearby (or sufficiently close) participant. While mutingthe inactive microphone is an option, this option can often bedistracting and cause participants to miss parts of the conversation,e.g., the first syllables of users that had been previously muted. In afast-moving conversation, this could be especially annoying.

BRIEF DESCRIPTION OF THE DRAWINGS

A more particular description of the disclosure briefly described abovewill be rendered by reference to the appended drawings. Understandingthat these drawings only provide information concerning typicalembodiments and are not therefore to be considered limiting of itsscope, the disclosure will be described and explained with additionalspecificity and detail through the use of the accompanying drawings.

FIG. 1 is a block diagram of a listening system that integratesfiltering electronic signals from remote signal sources withlocally-captured audio signals, according to an embodiment.

FIG. 2 is a simplified flow chart of a method for filtering theelectronic signals with the locally-captured audio signals, according toan embodiment.

FIG. 3A is a block diagram of an example set of single-input,binaural-output (SIBO) audio filters to perform adaptive filtration onmultiple electronic signals from remote signal sources according to anembodiment.

FIG. 3B is a block diagram of example set of multi-input,binaural-output (MIBO) audio filters to perform adaptive filtration onmultiple electronic signals from remote signal sources according to anembodiment.

FIG. 4A is a block diagram of an experimental setup with a moving humantalker with multiple signal sources and a non-moving listener accordingto an embodiment.

FIG. 4B is a block diagram of the experimental setup with threeloudspeaker signal sources and a moving listener according to anembodiment.

FIG. 5 is a set of graphs illustrating filter performance for a singlemoving talker according to an embodiment.

FIGS. 6A-6D is a set of graphs illustrating apparent interaural timedelays (ITDs) from either near signal sources or far signal sourcesvaried between the filters of FIG. 4A and FIG. 4B according to variousembodiments.

FIG. 7 is a block diagram illustrating an exemplary listening systeminvolving remote microphones that are co-located in an area andassociated with a group conversation according various embodiments.

FIG. 8 is a simplified block diagram of an example of crosstalkcancellation as between two remote microphones associated with two usersillustrated in FIG. 7 according to at least one embodiment.

FIG. 9 is a flow chart of a method of crosstalk cancellation as betweentwo remote microphone associated with the first and second users of FIG.7 according to at least one embodiment.

FIG. 10 is a simplified block diagram of an example crosstalkcancellation as between three remote microphones associated with threeusers illustrated in FIG. 7 according to at least one embodiment.

FIG. 11 is a graph illustrating noise reduction performance at a leftearpiece of a listener (where the higher values are better) according toexperimental embodiments.

FIG. 12A is a graph illustrating own-speech crosstalk suppressionperformance at a left earpiece of a listener using a head microphoneadapted to perform voice activity detection (VAD) according toexperimental embodiments.

FIG. 12B is a graph illustrating own-speech crosstalk suppressionperformance at a left earpiece of a listener using a lapel microphoneadapted to perform VAD according to experimental embodiments.

FIG. 13A is a graph illustrating high-frequency interaural leveldifferences of other talkers at ears of a listener where subjects taketurns speaking while moving to face each other according to experimentalembodiments.

FIG. 13B is a graph illustrating high-frequency interaural leveldifferences of other talkers at ears of a listener simulated withdouble-talk and triple-talk with subjects facing forward according toexperimental embodiments.

FIG. 14 is a block diagram of an example computer system in whichembodiments of the present disclosure can operate.

DETAILED DESCRIPTION

By way of introduction, the present disclosure relates to crosstalkcancellation and adaptive binaural filtering for a listening system thatuses remote signal sources and on-ear microphones to enhance receivedsound realism in all kinds of sound environments. Many listening systemsor devices can stream sound from external sources such as smartphones,televisions, or wireless microphones placed on a talker. These streamedsignals have lower noise than the sound picked up by microphonesintegrated within earpieces and so are helpful in noisy environments,but they lack spatial cues that help users to tell where sounds arecoming from, and they usually only work with one sound source (e.g.,talker) at a time.

Aspects of the present disclosure address the above and otherdeficiencies by combining one or more remote signal sources thatgenerate one or more electronic signals, which correspond to soundsources in an ambient environment, with a combination of audio signalsdetected locally by an ear microphone. The combination of audio signalscan include, for example, ambient sound and one or more propagated audiosignals, which correspond to the same sound sources as the one or moreelectronic signals but which have propagated though the air acousticallyto ears of the listener. A processing device that is coupled to the oneor more remote audio sources and to the ear microphone can then apply aset of audio filters to this combination of electronic signals and audiosignals. For example, a respective audio filter can process a respectiveelectronic signal of the one or more electronic signals with an errorsignal, which is based on an output of the ear microphone, to generatean output signal to an ear playback device. In these embodiments,acoustic cue components of the output signal match correspondingacoustic cue components of the combination of audio signals. In thisway, the disclosed listening system or device helps the human brain toseparate out sounds from different sources, such as two people talkingat the same time, and to do so binaurally. Thus, the present disclosuremakes it easier for users to hear in group conversations and to do sorealistically in a noisy ambient environment.

In various embodiments, the disclosed listening system and devices workby processing the clean signals from the remote signal sources to matchthe sound captured by ear (or earpiece) microphones of the listeningdevices. Because the ear microphones are next to the ears, these earmicrophones provide useful acoustic cues. The processing device appliesan adaptive filter, which can be employed as a set of audio filters,similar to the kind used for echo cancellation, but that enhances thesound instead of canceling the sound. The audio filters are updated asthe talkers and listener move. Thus, the current listening system anddevices are especially adapted to help listeners hear remote signalsources more accurately and with a more immersive experience, despitebeing within a noisy environment.

FIG. 1 is a block diagram of a listening system 100 (or listeningdevice) that integrates filtering electronic signals from remote signalsources with locally-captured audio signals, according to an embodiment.According to some embodiments, the listening system 100 includes one ormore remote signal sources 102, e.g., as a remote signal source 102A, aremote signal source 102B, a remote signal source 102C, and a remotesignal source 102N. The one or more remote signal sources 102, forexample, can include one or more microphones (e.g., a remote microphoneplaced on or near each of multiple talkers), a microphone that is partof a wearable listening device such as headphones, earbuds, a hearingaid, or the like, an array of microphones (e.g., placed near, orfocusing on, a group of talkers), one or more audio signal transmitters,one or more broadcast devices, one or more sound systems, or acombination thereof.

A microphone placed on or near each talker, for example, would provide areliable electronic signal from its wearer (e.g., each participant in apanel discussion), while an array of microphones can be used to enhanceall nearby sounds, or to focus on specific sounds of interest, makingthe array well-suited for dynamic environments where talkers may freelyjoin or leave the conversation within the listening range of the array(e.g., at a small group of people discussing a poster presentation in alarge noisy convention center). In some embodiments, the array ofmicrophones is a ceiling mounted array of beamforming microphonesdesigned to pick up on individual talkers that are moving around incertain zones of interest.

In some embodiments, the signal sources 102 includesound-system-generated electronic signals while speakers of the soundsystem produce corresponding audio signals that arrive at a listener aspropagated audio signals. Certain venues such as theaters and churchesmay employ telecoil induction loops and radio-frequency or infraredbroadcasts so that the transmitted signal appears to originate from thesound system of the venue.

In these embodiments, the listening system 100 further includes a pairof listening devices, such as a first ear listening device 110 (alsoreferred to herein as associated with the right ear (or R) for ease ofexplanation) and a second ear listening device 120 (also referred toherein as the left ear (or L) for ease of explanation). The first earlistening device 110 can further include a first ear microphone 112 anda first ear playback device 116. The first ear microphone 112 can detecta first combination of audio signals including ambient sound (includingnoise) and one or more propagated audio signals, corresponding to theone or more electronic signals, received at a first ear of a listener.The second ear listening device 120 can further include a second earmicrophone 122 and a second ear playback device 126. The second earmicrophone 122 can detect a second combination of audio signalsincluding ambient sound and one or more propagated audio signals,corresponding to the one or more electronic signals, received at asecond ear of the listener that is different than the first ear.

In various embodiments, the first and second ear listening devices 110and 120 are hearing aids, cochlear implants, ear buds, head phones,bone-conduction devices, or other types of in-ear, behind-hear, orover-the-ear listening devices. Thus, the first and second ear playbackdevices 116 and 126 can each be a receiver (e.g., in a hearing aid or acochlear implant), a loudspeaker (e.g., of a headphone, headset,earbuds, or the like), or other sound playback device generallydelivering sound to the first and/or second ears, either acoustically,via bone-conduction, or other manner of mechanical transduction.

In some alternative embodiments, the reference signals from the firstand second ear microphones 112 and 122 can be derived from “virtualmicrophones” inferred from other physical signals, for example using alinear prediction filter or other means of linear estimation. Forexample, a multiple-input, binaural-output linear prediction filtercould predict the signal at the ears based on signals captured by amicrophone array surrounding the head. Such a prediction filter could bederived from prior measurements using on-ear microphones and thenapplied in the field. This use of virtual microphones is not restrictedto “prediction” and may involve a variety of methods of estimating thesound that would appear at a microphone at the ear through measurementsfrom other physical signals.

In various embodiments, the listening system 100 further includes amobile device 140, which can be any type of mobile processing devicesuch as a mini-computer, a programmed processing device, a smart phone,a mini-tablet, or the like. The mobile device 140 can include aprocessing device 150, one or more audio detectors 155, a user interface160, which can be integrated within a graphical user interfacedisplayable on a screen, for example, and a communication interface 170.In some embodiments, the processing device 150 is at least partiallylocated within either of the first ear listening device 110 or thesecond ear listening device 120, or both. In at least some embodiments,the processing device 150 is coupled to the one or more remote signalsources 102, to the first ear microphone 112, to the first ear playbackdevice 116, to the second ear microphone 122, and to the second earplayback device 126.

In some embodiments, one or more of the audio detectors 155 are locatedwithin either of the first ear listening device 110 or the second earlistening device 120, or both, where optional locations are illustratedin dashed lines. In some embodiments, the communication interface 170 isadapted to communicate in over networks such as a personal area network(PAN), a Body Area Network (BAN), or a local area network (LAN) usingtechnology protocols such as, for example, Bluetooth®, Zigbee®, or thelike similar protocol that may be generated in the future that issufficiently low-latency for electronic audio signal transmission.

In at least some embodiments, the listening system 100 includes a firsthearing device containing the first ear microphone 112 and connected tothe first ear playback device 116 and a second hearing device containingthe second ear microphone 122 and connected to the second ear playbackdevice 126. The processing device 150 can be located within one of thefirst hearing device, the second hearing device, or the mobile device140 communicatively coupled to the first hearing device and the secondhearing device.

FIG. 2 is a simplified flow chart of a method 200 for filtering theelectronic signals with the locally-captured audio signals, according toan embodiment. The method 200 can be performed by processing logic thatcan include hardware (e.g., processing device, circuitry, dedicatedlogic, programmable logic, microcode, hardware of a device, integratedcircuit, etc.), software (e.g., instructions run or executed on aprocessing device), or a combination thereof. In some embodiments, themethod 200 is performed by the processing device 150 of FIG. 1 , e.g.,in conjunction with other hardware components of the listening system100 (or device). Although shown in a particular sequence or order,unless otherwise specified, the order of the processes can be modified.Thus, the illustrated embodiments should be understood only as examples,and the illustrated processes can be performed in a different order, andsome processes can be performed in parallel. Additionally, one or moreprocesses can be omitted in various embodiments. Thus, not all processesare required in every embodiment. Other process flows are possible.

At operation 210, the processing logic detects a combination of audiosignals including ambient sound and one or more propagated audiosignals, corresponding to the one or more electronic signals, receivedat an ear of a listener from the one or more remote signal sources 102.

At operation 220, the processing logic applies a set of audio filters torespective ones of the one or more electronic signals.

At operation 230, the processing logic applies an audio filter (of theset of audio filters) to a respective electronic signal of the one ormore electronic signals with an error signal, which is based on (e.g., afunction of) an output of the ear microphone, to generate a first outputsignal to the ear playback device.

At operation 240, the set of audio filters causes acoustic cuecomponents of the output signal to match corresponding acoustic cuecomponents of the combination of audio signals. Spatial cues areespecially helpful with multiple conversation partners, as they helplisteners to distinguish signals from different talkers.

With additional reference to FIG. 1 , when at least the same number ofremote signal sources 102 (such as microphones) as talkers are employedwithin the listening system 100, the spatial cues of the multipletalkers are best preserved. One advantage of integrating filtering ofelectronic signals from the remote signal sources 102 and the propagatedaudio signals detected by the first ear microphone 112 and the secondear microphone 122 is that ambient noise is weakly correlated betweenthe remote signal sources 102 and the local ear microphones. Thisproperty has been used to identify the acoustic channel between talkersof interest and the microphones of an array of remote microphones. Here,this correlation property can be exploited to match the magnitude and/orphase of the electronic signals to the propagated audio signals receivedby the first and second ear microphones 112 and 122.

In various embodiments, as will be explained in more detail withreference to FIGS. 3A-3B, adaptive filters use the electronic signals asinputs and the combined (received) audio signals as references for thedesired outputs. If the noise is uncorrelated between the inputelectronic signals and the reference signals, then the filter matchesthe cues of the signals of interest. This adaptive approach need notexplicitly estimate the acoustic channel or attempt to separate thesources. The present disclosure proposes two variants of adaptivefiltering. A first variant can be a set of independently-adaptedsingle-input, binaural-output (SIBO) filters for wearable microphones onspatially separated moving talkers, which will be discussed in moredetail with reference to FIG. 3A. The second variant can be ajointly-adapted multiple-input, binaural-output (MIBO) filter suitablefor arrays and closely-grouped talkers which will be discussed in moredetail with reference to FIG. 3B.

For purposes of explanation, assume there are M of the remote signalsources 102 (which for the present examples are assumed to be remotemicrophones) placed near N talkers of interest. The reader can assumethat the present example can be expanded to different remote signalsources 102 that generate electronic signals and audio signals, e.g.,from speakers of a sound system for example or other venue examplereferred to herein. For purposes of the mathematical formulation, assumethat the electronic signals from the remote signal sources 102 areavailable instantaneously and synchronously to the first and second earlistening devices 110 and 120, for example.

Let s[t]=[s₁[t], . . . , s_(N)[t]]^(T) be the sampled speech signalsproduced by the talkers of interest. Consider a short time intervalduring which the talkers, listener, and microphones do not move, orwhose movement is sufficiently small that its effects can be ignored.The discrete-time signals x_(e)[t]∈

² received by the first and second ear microphones 112 and 122 andx_(r)[t]=[x_(r,1)[t], . . . , x_(r,M)[t]]^(T) received (or generated) bythe remote signal sources 102 are given by

$\begin{matrix}{{x_{e}\lbrack t\rbrack} = {{\sum\limits_{n = 1}^{N}{( {a_{e,n}*s_{n}} )\lbrack t\rbrack}} + {z_{e}\lbrack t\rbrack}}} & (1)\end{matrix}$ $\begin{matrix}{{{x_{r}\lbrack t\rbrack} = {{\sum\limits_{n = 1}^{N}{( {a_{r,n}*s_{n}} )\lbrack t\rbrack}} + {z_{r}\lbrack t\rbrack}}},} & (2)\end{matrix}$

where * denotes linear convolution, a_(e,n)[t]∈

² and a_(r,n)[t]∈

^(M) are equivalent discrete-time acoustic impulse responses: i) betweensource n and the first and/or second ear microphones 112 and 122; andii) between source n and the remote signal sources 102, respectively,for n=1, . . . , N. Further, z_(e)[t]∈

² and z_(r)[t]∈

^(M) are additive noise at the first and/or second ear microphones 112and 122 and the remote signal sources 102, respectively. While theadaptive filters referred to herein are generally describedmathematically herein as implemented in the time domain for ease ofexplanation, these adaptive filters can also be implemented in thetime-frequency domain using, e.g., the short-time Fourier transform, afilter bank, or other appropriate filter structure for adaptive acousticfilters.

In various embodiments, the listening system 100 produces a binauraloutput y[t]∈

² given by

$\begin{matrix}{{{y\lbrack t\rbrack} = {\sum\limits_{m = 1}^{M}{( {w_{m}*x_{r,m}} )\lbrack t\rbrack}}},} & (3)\end{matrix}$

where w_(m)[t]∈

² is a discrete-time binaural filter for inputs m=1, . . . , M. Unlikein a binaural beamformer, the propagated audio signals received at theears are not inputs to the filter used to generate the output signal tothe right and left ear playback devices 116 and 126, e.g., y[t].However, the propagated audio signals could be mixed with y[t] ifdesired to improve spatial awareness of ambient noise, as will befurther discussed.

In at least some embodiments, the listening system 100 is designed to beperceptually transparent so that the binaural output approximates thesignal captured by the first and/or second ear microphones 112 and 122but with less noise. Mathematically, the desired output d[t]∈

² can be given by

$\begin{matrix}{{{d\lbrack t\rbrack} = {\sum\limits_{n = 1}^{N}{( {g_{n}*a_{e,n}*s_{n}} )\lbrack t\rbrack}}},} & (4)\end{matrix}$

where g_(n)[t]∈

is the desired processing to be applied to each source n. The g_(n)'scan be used to apply different amplification and spectral shaping toeach source, for example based on distance. The binaural impulseresponses a_(e,n) encode the effects of room acoustics on the spectrumof each speech signal as well as the interaural time and leveldifferences used to localize sounds.

It may convenient to analyze the filters in the frequency domain. LetW(ω)∈

^(2×M), A_(e)(ω)∈

^(2×N), A_(r)(ω)∈

^(M×N) and G(ω)∈

^(N×N) be the discrete-time Fourier transforms of their respectiveimpulse responses, where G is a diagonal matrix of desired responses forthe N sources. To preserve the spectral and spatial cues of the Ndistinct sources, the filter should satisfy

W(ω)A _(r)(ω)=A _(e)(ω)G(ω).  (5)

For arbitrary A_(r), the filter can meet this condition if M≥N, that is,there are at least as many remote signal sources as talkers.

Adaptive filters are often designed to minimize a mean square error(MSE) between the output and desired signals. If the speech sources andnoise were wide-sense stationary random processes with knownsecond-order statistics and if the acoustic impulse responses wereknown, one could directly minimize the MSE loss

MSE[t]=

[|y[t]−d[t]| ²]  (6)

where

denotes statistical expectation.

In various embodiments, if the filters are allowed to be non-causal andto have infinite length, then the linear minimum-mean-square-error(MMSE) filter can be readily computed in the frequency domain. Assumethat all signals have zero mean and that the speech signals areuncorrelated with the noise signals. Let R_(s)(ω)∈

^(N×N), R_(z) _(e) (ω)∈

^(2×2) and R_(z) _(r) (ω)∈

^(M×M) be the power spectral density matrices for s[t], z_(e)[t], andz_(r)[t], respectively, and let R_(z) _(e) _(z) _(r) (ω)∈

^(2×M) be the cross-power spectral density between z_(e)[t] andz_(r)[t]. Then the MMSE filter is given by

W _(MMSE)(ω)=A _(e)(ω)G(ω)R _(s)(ω)A _(r) ^(H)(ω)·[A _(r)(ω)R _(s)(ω)A_(r) ^(H)(ω)+R _(z) _(r) (ω)]⁻¹.  (7)

If A_(r) has full column rank, then the Woodbury identity can be used toshow that the MMSE filter satisfies Equation (5) in thehigh-signal-to-noise-ratio (SNR) limit. In the remainder of thedisclosure, the frequency variable w is omitted for brevity.

The MMSE filter relies on the signal statistics and the transferfunctions between the remote signal sources 102 and the first and secondear microphones 112 and 122, which can be difficult to estimate.Fortunately, when remote microphones are close to the sources, theyprovide high-quality reference signals that eliminate the need forcomplex source separation algorithms. Ambient noise signals are oftenmostly uncorrelated between on-ear and remote microphones. Theprocessing device 150 may use this property to efficiently estimate therelative transfer function between the remote signal sources 102 and thefirst and second ear microphones 112 and 122 using the noisy mixture.This same principle can be applied to the adaptive filtering problem,replacing the desired signal d[t] with the noisy propagated audio signalreceived at the ear microphone(s), as will be discussed in more detailwith reference to FIGS. 3A-3B.

In various embodiments, the above adaptive filter formulationautomatically time-aligns the signals, e.g., adds delays to the remoteelectronic signals, which travel faster than sound. Further, theadaptive filter formulation matches the magnitude and/or phase betweenthe remote electronic signals and the propagated audio signals thatarrive at the speed of sound through the air at the ears of thelistener. These features help to prevent echoes and distortion.

Further, in at least some embodiments, a listener is enabled access totuning options via the user interface 160. For example, the userinterface 160 can display a number of menu selection items, from whichthe listener can choose to, e.g., listen only to the remote signalsources (e.g., distant talkers), or choose to hear everything in theenvironment for situational awareness.

FIG. 3A is a block diagram of example single-input, binaural-output(SIBO) audio filters 350A to perform adaptive filtration on multipleelectronic signals from the remote signal sources 102 according to anembodiment. The SIBO audio filters 350A are labeled W_(m,R) for a firstset audio filters for the right ear and W_(m,L) for a second set ofaudio filter for the left ear. Each SIBO audio filter can process aseparate incoming remote electronic signal (coming from the one or moreremote signal sources 102) together with an error signal that is fedback as reference from an output (x_(e,R) or x_(e,L)) of the first orsecond ear microphone 112 or 122, respectively. The adaptive filteringof FIG. 3A can be best suited restricting the listening system 300 toonly certain talkers or to apply different amplification to differenttalkers, or if remote microphones move so that coherent combining isdifficult. In these situations, where tracking individual talkers, aremote microphone can be placed on or near each individual talker as theremote signal sources 102.

More specifically, in some embodiments, the processing device 150applies a first set of audio filters (W_(m,R)) including an audio filter(e.g., W_(1,R)) to process a respective electronic signal of the one ormore electronic signals with a first error signal 301A, which is basedon an output (x_(e,R)) of the first ear microphone 112, to generate afirst output signal 340A to the first ear playback device 116. In someembodiments, acoustic cue components of the first output signal 340Amatch corresponding acoustic cue components of the first combination ofaudio signals received by the first ear microphone 112. In at least someembodiments, the processing device 150 further applies a second set ofaudio filters (W_(m,L)) including an audio filter (e.g., W_(1,L)) toprocess a respective electronic signal of the one or more electronicsignals with a second error signal 301B, which is based on an output(x_(e,L)) of the second ear microphone 122, to generate a second outputsignal 340B to the second ear playback device 126. In some embodiments,acoustic cue components of the second output signal 340B matchcorresponding acoustic cue components of the second combination of audiosignals received by the second ear microphone 122.

Each filter w_(m) is designed to reproduce the speech of talker m, sothat (w_(m)*x_(r,m))[t]≈(g_(m)*a_(e,m)*s_(m))[t] for m=1, M=N, and thesymbol denotes “approximately equal”. Each filter is computed separatelyto minimize its own loss function

_(m) [t]=

[|(w _(m) *x _(r,m) [t]−(g _(m) *x _(e))[t]| ²].  (8)

The solution is given in the frequency domain by the 2×1 filter

W _(m) =G _(m) [A _(e) R _(s) A _(r,m) ^(H) +R _(z) _(e) _(z) _(r,m) ][A_(r,m) R _(s) A _(r,m) ^(H) +R _(z) _(r,m) ]⁻¹,  (9)

where A_(r,m) is the row of A_(r) corresponding to microphone m. If thespeech sources are uncorrelated, then the SIBO filter can be expressedas

$\begin{matrix}{W_{m} = {G_{m}\frac{{A_{e,1}R_{s_{1}}A_{r,m,1}^{*}} + \ldots + {A_{e,M}R_{s_{M}}A_{r,m,M}^{*}} + R_{z_{e}z_{r,m}}}{{{❘A_{r,m,1}❘}^{2}R_{s_{1}}} + \ldots + {{❘A_{r,m,M}❘}^{2}R_{s_{M}}} + R_{z_{r,m}}}}} & (10)\end{matrix}$

It can be seen from Equation (10) that the interaural cues are distortedby crosstalk among the remote microphones as well as by correlatednoise. Crosstalk can also produce unintended interference effects, suchas comb-filtering distortion, when the SIBO filter outputs are summed.

In some embodiments, intermediate outputs 303 of the audio filters areoptionally further processed, e.g., by other_processing_1L,other_processing_2L, and other_processing_3L for each of the illustratedthree audio filters of the first set of audio filters orother_processing_1R, other_processing_2R, or other_processing_3R foreach of the three audio filters of the second set of audio filters. Inthese embodiments, each of these “other_processing” blocks include thesame or different signal processing, e.g., such as frequency-selectivegain, feedback suppression, noise reduction, and dynamic rangecompression. For example, dynamic range compression could operateindependently on each of the intermediate outputs 303, which may preventcertain types of distortion.

In at least some embodiments, the one or more electronic signals includemultiple electronic signals, each audio filter of the first set of audiofilters is to generate an intermediate output signal corresponding to arespective electronic signal, and the processing device 150 furthercombines (e.g., at a first summer 310A) the intermediate output signalsto generate the first output signal 340A. In at least some embodiments,each audio filter of the second set of audio filters is to generate anintermediate output signal corresponding to a respective electronicsignal, and the processing device 150 further combines (e.g., at asecond summer 310B) the intermediate output signals to generate thesecond output signal 340B.

In additional embodiments, the processing device imparts additionalprocessing before the output signals 340A and 340B are generated. Morespecifically, in some embodiments, the processing device 150 processes(e.g., with other_processing_4R) the output (x_(e,R)) of the first earmicrophone 112 to generate a first processed microphone signal andmixes, into the first output signal 340A, the first processed microphonesignal. Further, in these embodiments, the processing device 150processes (e.g., with other_processing_4L) the output (x_(e,L)) of thesecond ear microphone 122 to generate a second processed microphonesignal and mixes, into the second output signal 340B, the secondprocessed microphone signal. The mixing can occur at the first andsecond summers 310A and 310B, respectively. Further, the processingdevice 150 can control the relative levels of the live audios signalsfrom the first and/or second ear microphones 112 and 122 compared to theprocessed audio signals e.g., for a selected trade-off betweensignal-to-noise ratio improvement and environmental awareness. This typeof mixing can also reduce distortion of binaural cues for non-targetsound sources.

In some embodiments, the processing device 150 further processes thefirst combination of the intermediate output signals withpost-processing (e.g., other_processing_5R) that corresponds to an audioparameter before generating the first output signal 340A. In someembodiments, the processing device 150 further processes the secondcombination of the intermediate output signals with post-processing(e.g., other_processing_5L) that corresponds to the audio parameterbefore generating the second output signal 340B. In some embodiments,the processing device receives, via the user interface 160, a menuselection to adjust the audio parameter, and adjusts the audio parameterof the post-processing according to the menu selection. The audioparameter can include, for example, different volume levels atrespective ear playback devices 116 and 126, different volume levelsimparted to the electronic signals from the one or more signal sources102 compared to the volume level of the propagated audio signalsreceived at the first and second ear microphones 112 and 122 and thelike.

In at least some embodiments, instead of enhancing signals of interest,the listening system 100 could be used to remove unwanted sound sourcesignals. The output of one or more of the set of audio filters can besubtracted from the live sound captured by the first and/or second earmicrophones 112 and 122. Such a system could, for example, reduce thelevel of music audio (or other public-address sounds) in a public venueusing a copy of the audio signal of the music transmitted by the soundsystem. The unwanted sound signal could also be deliberately introducedin order to protect the privacy of a conversation between users of thesystem, either as music, white-noise, or other background noise.

In at least some embodiments, the first set of audio filters (W_(m,R))and the second set of audio filters (W_(m,L)) are defined by aparametric model that is to separately, for each of the remote signalsources 102A . . . 102N, at least one of apply an equalization filterthat is shared by both the first output signal and the second outputsignal or encode interaural time and amplitude level differences betweenthe first output signal and the second output signal used to localizesounds.

In at least some embodiments, the first set of audio filters (W_(m,R))is defined by a parametric model that is to separately, for each of theremote signal sources 102A . . . 102N, at least one of: encode a delaytime for the first output signal; perform parametric equalization forthe first output signal; encode a set of effects of ambient acoustics ona spectrum of the one or more electronic signals; define a filter thathas an impulse response of a particular length; or define a filterdescribed by a set of poles and zeros. The parametric model can also beapplied to the second set of audio filters (W_(m,L)).

In at least some embodiments, by way of example, the first error signal301A includes a difference between the output of the first audio filter(W_(1,R)) and the output (x_(e,R)) of the first ear microphone 112. Inthese embodiments, the processing device 150 is further to input a firstelectronic audio signal, from a first remote signal source (e.g.,x_(r,1)), to the first audio filter. The processing device 150 may causea first relative transfer function of the first audio filter toadaptively minimize the first error signal in a first intermediateoutput signal, where it is understood that the mean-square value of theerror or a variety of other functions of the error are to be minimized.The processing device 150 can input a second electronic signal, from asecond remote signal source (e.g., x_(r,1)), to a second audio filter(W_(2,R)) of the first set of audio filters. The second error signalincludes a difference between the output of the second audio filter(W_(2,R)) and the output (x_(e,R)) of the first ear microphone 112. Theprocessing device 150 may then cause a second relative transfer functionof the second audio filter (W_(2,R)) to minimize the MMSE in the seconderror signal in a second intermediate output signal, and combine, togenerate the first output signal 340A, the first intermediate outputsignal with the second intermediate output signal, where it isunderstood that the mean-square value of the error or other function ofthe error can be minimized. These operations can be extended toadditional ones of the remote signal sources 102.

In some embodiments, the processing device 150 further applies a firstprocessing variable to the first intermediate output signal (e.g.,other_processing_1R) and applies a second processing variable (e.g.,other_processing_2R) to the second intermediate output signal. Withadditional reference to FIG. 1 , in some embodiments, a first audiodetector (of the one or more audio detectors 155) is coupled to theprocessing device 150, which disables the first audio filter (W_(1,R))in response to sound from the first remote signal source not satisfyinga threshold magnitude. Further, in these embodiments, a second audiodetector (of the one or more audio detectors 155) is coupled to theprocessing device 150, which disables the second audio filter (W_(2,R))in response to sound from the second remote signal source not satisfyingthe threshold magnitude. To satisfy the threshold magnitude is to begreater than or at least equal to the threshold magnitude. Thisthreshold magnitude may be set within a certain audio range thatstatistically determines that the remote signal source is not active,e.g., the remote talker is not talking or not talking sufficiently loudinto a remote microphone. In this way, disabling each respective audiofilter that is coupled with a particular remote signal source 102 thatis inactive helps to improve the performance of that filter and toconserve processing power needed by the processing device 150.

FIG. 3B is a block diagram of an example set of -input, binaural-output(MIBO) audio filters 350B to perform adaptive filtration on multipleelectronic signals from remote signal sources according to anembodiment. As discussed, the MIBO audio filters 350B may be especiallysuited for arrays of remote signal sources 102 (e.g., microphone array,speaker array, or the like) and closely-grouped talkers that may comeand go within an area of interest to which a microphone array ispointing.

In at least some embodiments, the MIBO audio filters 350B are similarlylabeled W_(m,R) for the first set of audio filters for the right ear andW_(m,L) for the second set of audio filters for the left ear, and thushave similarities to the SIBO audio filters 350A. Different from theSIBO audio filter embodiment (FIG. 3A), however, the first set of audiofilters (W_(m,R)) jointly process the one or more remote electronicsignals with a single first feedback signal 302A received from the firstear microphone 112 to generate first intermediate output signals 303A.The processing device 150 may combine (e.g., using a summer 305A) thefirst intermediate output signals 303A to generate the first outputsignal 340A. Further, the second set of audio filters (W_(m,L)) jointlyprocess the one or more remote electronic signals with a single secondfeedback signal 302B received from the second ear microphone 122 togenerate second intermediate output signals 303B. The processing device150 may combine (e.g., using a summer 305B) the second intermediateoutput signals 303B to generate the second output signal 340B.

In some embodiments, the first error signal 301A includes a differencebetween a combination of the first intermediate output signals 303A andthe output (x_(e,R)) of the first ear microphone 112, and a singleacoustic loss function of the first set of audio filters is toadaptively minimize the mean-square value of the first error signal,where it is understood that the mean-square value or other function ofthe error can be minimized. In some embodiments, the second error signal301B includes a difference between a combination of the secondintermediate output signals 303B and the output (x_(e,L)) of the secondear microphone 122, and a single acoustic loss function of the secondset of audio filters is to adaptively minimize the mean-square value ofthe second error signal, where it is understood that the mean-squarevalue or other function of the error can be minimized.

To discuss the MIBO audio filters 350B mathematically, suppose that adesired response is the same for all talkers, that is, g_(n)[t]=g[t] forall n and G(ω)=G(ω)I for all ω. Instead of minimizing the true MSE, theprocessing device 150 can minimize the loss function

[t]=

[|y[t]−(g*x _(e))[t]| ²].  (11)

In some embodiments, if the signals are wide-sense stationary, then thelinear MMSE filter that minimizes is given in the frequency domain by

W _(MIBO) =G[A _(e) R _(s) A _(r) ^(H) +R _(z) _(e) _(z) _(r) ][A _(r) R_(s) A _(r) ^(H) +R _(z) _(r) ]⁻¹.  (12)

This MIBO audio filter attempts to replicate both the desired speech andthe unwanted noise at the ears, e.g., as delivered within the outputsignals 340A and 340B. However, if the noise is uncorrelated between thecombined propagated audio signals and remote electronic signals, thenR_(z) _(e) _(z) _(r) (ω)=0 and the adaptive filter of Equation (12) isidentical to the MMSE filter of Equation (7). That is, the MIBO audiofilter cannot use the remote electronic signals to predict the noise,only propagated audio of the talkers (or other signal sources) ofinterest.

With correlated noise, the spatial cues of the target are distorted bythose of the noise, as can be readily seen in the special case whereM=N=1:

$\begin{matrix}{{W_{MIBO}A_{r}} = {G{\frac{{A_{e}{❘A_{r}❘}^{2}R_{s}} + {R_{z_{e}z_{r}}A_{r}}}{{{❘A_{r}❘}^{2}R_{s}} + R_{z_{r}}}.}}} & (13)\end{matrix}$

In the numerator of Equation (13), the noise at the remote electronicsignals distorts interaural cues to the extent that it is correlatedwith the noise at the remote microphone (or other signal source). In thedenominator of Equation (13), the magnitude of the noise at the remotemicrophone alters the magnitude of the output, just as it would for theMMSE filter. Thus, system performance may strongly depend on placementof the remote microphone relative to the remote speakers.

In some embodiments, a property of the MIBO audio filters 350B is thatthe combined adaptive filter processing does not separate the sources ofinterest, nor does each MIBO audio filter 350B explicitly model theiracoustic transfer functions. Since the inputs to each MIBO audio filter350B can be combinations of the speech signals of interest, the MIBOaudio filter 350B is suitable for systems with significant crosstalk,such as wearable microphones on nearby talkers or a microphone arrayplaced near a group of talkers. It can also adapt easily as talkers movearound the area near the microphones or as they enter and leave aconversation, where preferably no more than M talkers participate at atime.

In additional embodiments, the processing device imparts additionalprocessing before the output signals 340A and 340B are generated. In atleast some embodiments, the processing device 150 further optionallyprocesses (e.g., with other_processing_1R) the combination of the firstintermediate output signals 303A to generate the first output signal340A. Further the processing device 150 optionally processes (e.g., withother_processing_1L) the combination of the second intermediate outputsignals 303B to generate the second output signal 340B.

In at least some embodiments, the processing device 150 optionallyprocesses (e.g., with other_processing_2R) the output (x_(e,R)) of thefirst ear microphone 112 to generate a first processed microphonesignal. The processing device 150 can further mix (e.g., via a firstsummer 310A) the first processed microphone signal into the first outputsignal 340A. Further, in these embodiments, the processing device 150optionally processes (e.g., with other_processing_2L) the output(x_(e,L)) of the second ear microphone 122 to generate a secondprocessed microphone signal. The processing device 150 can further mix(e.g., via a second summer 310B) the second processed microphone signalinto the second output signal 340B.

In at least some embodiments, the processing device 150 furtheroptionally processes (e.g., with other_processing_3R) the first outputsignal 340A before outputting the first output signal 340A to the firstear playback device 116. In these embodiments, the processing device 150further optionally processes (e.g., with other_processing_3L) the secondoutput signal 340B before outputting the second output signal 340B tothe second ear playback device 126. In some embodiments, the processingdevice 150 can apply the various further processing (e.g., designated as“other_processing” herein) separately or in combination, as related toeither of the SIBO audio filters 350A (FIG. 3A) or the MIBO audio filter350B (FIG. 3B).

In some embodiments, if the processing device 150 is unsuccessful withsource separation of the remote signal sources 102 within a SIBO mode ofapplying the SIBO audio filters 350A for each ear (FIG. 3A), theprocessing device 150 switches to a MIBO mode of applying the MIBO audiofilter 350B for each ear (FIG. 3B). While the adaptive filtration of theSIBO audio filters 350A and the MIBO audio filter 350B, respectively,are explained above to use the least mean squares algorithm, in otherembodiments, a different adaptive algorithm is employed, such asrecursive least squares or normalized least mean squares, or others thatare known in the field of adaptive filtering. The samples of the audiosignals could be processed in blocks, and the learning rate can also bechanged over time. Thus, the listening system 100 does not depend on thespecific adaptive algorithm used.

In some embodiments, instead of an arbitrary audio filter or a specificparametric model, the adaptive filtering of an applied set of audiofilters could choose the filter that best matches the observed (e.g.,processed) audio from a set of possible audio filters. This set ofpossible audio filters could include, for example, a database of generichuman head-related impulse responses, like those used forvirtual-reality audio; a database of personalized head-related impulseresponses for the listener user, which have been measured directly orinferred based upon the head shape of the listener user; a database ofhead-related impulse responses augmented by interpolation techniques; amanifold of head-related impulse responses generated by a manifoldlearning algorithm. These databases/manifolds could be refined basedupon room acoustics of the ambient environment. For example, the systemcould select from one set in a strongly reverberant room and another setin a weakly reverberant room.

In some embodiments, one or more audio filter(s) applied by theprocessing device 150 is initialized (and occasionally re-initialized)using either a parametric filter model or a filter chosen from adatabase or manifold. The chosen filter can then be fine-tuned using theadaptive filtering algorithm such as those discussed herein.

In some embodiments, one or more audio filter(s) applied by theprocessing device 150 is constrained based on a physical or perceptualmodel. For example, the adaptive algorithm could impose upper and lowerbounds on the magnitude of the frequency response within different bandsor the variation of the magnitude across bands. The audio filters at thetwo ears can also be constrained so that they do not deviate from eachother by more than the expected delay or attenuation due to the head ofa listener.

In some embodiments, the adaptation of the one or more audio filter(s)is aided by position information from a head-tracking system or othermotion-capture devices. For example, a head-related impulse response canbe selected based not only on the audio data, but also on the directionof the talker relative to the head orientation of the listener user.

In some embodiments, the listening system 100 can improve the quality ofthe first and second output signals 340A and 340B (FIGS. 3A-3B) in areverberant environment by the processing device 150 performingreverberation-reducing processing such as truncating the impulseresponse (or the part of the impulse response following the direct path)to a prescribed length in order to reduce reverberation. When selectingfilters from a database or manifold, the processing device 150 canfurther choose equivalent filters that share the same spatial cues asthe propagated audio signal received at the first and/or second earmicrophones 112 and 122, but have milder reverberation. The processingdevice can further adjust gain and reverberation levels based on thedistance from the talker to the listener user. The distance can beinferred from the acoustic time of flight of the signal or measureddirectly using range-finding technology built into the processing device150. The processing device 150 can further switch off the binauralfilter when the talker is far away. Beyond a prescribed distance, thelistening system 100 would function like a conventional remotemicrophone and present the remote signal diotically.

People with hearing loss often have difficulty hearing people who arenot facing them. If the processing device 150 detects that the talker isfacing away from the listener, for example based on the slope of themagnitude response of the acoustic transfer function, the processingdevice 150 can substitute a head-related impulse response for a remotesignal source 102 in the same location, but facing toward the listener.

In some embodiments, the listener user could use a control interfacedisplayed within the user interface 160 (e.g., that includes physicalknobs, a smartphone app, voice commands, gestures, and the like) toadjust the relative levels of different sounds. The options couldinclude the individual sounds corresponding to remote microphones, thelive mixture at the ears, and external signals such as playback from apersonal electronic device. In some embodiments, sound levels could beadjusted relative to the magnitude level at the ears, rather than themagnitude level at the source or some absolute measure of sound pressurelevel. For example, the user could make a conversation partner “twice asloud as real life” or annoying music “half as loud as real life.” Thelistener user could also directly change reverberation levels of eachsource, or choose upper or lower bounds on acceptable reverberationlevels. To provide a more intuitive user experience, instead ofproviding separate controls for gain and reverberation, the controlinterface could allow the listener user to change the perceived distanceof the remote signal source 102. If the listener user wishes to “movethe sound closer,” the processing device 150 can increase gain anddecrease reverberation by a corresponding amount, for example.

With additional reference to FIG. 1 , and in at least one embodiment,the one or more remote signal sources 102 include one or more remotemicrophones that detect one or more local audio signals and transmit oneor more electronic signals corresponding to the one or more local audiossignals as well as one or more remote signal sources that generate oneor more additional electronic signals. The one or more remote signalsources can include one or more audio signal transmitters, one or morebroadcast devices, one or more sound systems, or a combination thereof.In this embodiment, the first ear microphone 112 detects a firstcombination of audio signals including ambient sound and propagatedaudio signals, corresponding to the one or more electronic signals andthe one or more additional electronic signals, received at a first earof a listener. In this embodiment, the second ear microphone 122 detectsa second combination of audio signals including ambient sound andpropagated audio signals, corresponding to the one or more electronicsignals and the one or more additional electronic signals, received at asecond ear of the listener that is different than the first ear.

In this at least one embodiment, the processing device 150 applies afirst set of audio filters (e.g., W_(m,R)) to generate the first outputsignal 340A to the first ear playback device 116. The first set of audiofilters can include at least a first audio filter (e.g., W_(1,R)) toprocess a respective electronic signal of the one or more electronicsignals with a first error signal, which is based on an output of thefirst ear microphone 112. At least a second audio filter (e.g., W_(2,R))is included to process a respective electronic signal of the one or moreadditional electronic signals with one of the first error signal (for aMIBO audio filter) or a second error signal (for a set of SIBO audiofilters), respectively, based on the output of the first ear microphone.In some embodiments, acoustic cue components of the first output signalmatch corresponding acoustic cue components of the first combination ofaudio signals.

In at least this embodiment, the processing device applies a second setof audio filters (e.g., W_(m,R)) to generate the second output signal340B to the second ear playback device 126. The second set of filterscan include at least a third audio filter (e.g., W_(1,L)) to process arespective electronic signal of the one or more electronic signals witha third error signal, which is based on an output of the second earmicrophone 122. At least a fourth audio filter (e.g., W_(2,R)) isincluded to process a respective electronic signal of the one or moreadditional electronic signals with one of the third error signal (for aMIBO audio filter) or a fourth error signal (for a SIBO audio filter)based on the output of the second ear microphone 122. In someembodiments, acoustic cue components of the second output signal 340Bmatch corresponding acoustic cue components of the second combination ofaudio signals.

With additional reference to FIGS. 3A-3B, the frequency-domain analysisassumes that the audio filters can be non-causal and can have infinitelength. In a real listening system, the audio filters are causal andhave finite length. Fortunately, because the remote microphones areplaced near the talkers, the binaural filters should closely resemblethe acoustic impulse responses between the talkers and listener. As longas the group delay of the desired responses (g_(n)) plus anytransmission delay between the remote microphones (or other remotesignal sources 102) and the first and second ear microphones 112 and 122is smaller than the acoustic time of flight between talkers andlistener, it should be possible to design causal binaural filters.

The above analysis of also assumes that the acoustic listening system100 (or devices) is stationary. In reality, human talkers and listenersmove constantly. To adapt to changing conditions, the SIBO and MIBOaudio filters can be designed to be time-varying. Let w_(m)[τ; t]∈

² be the filter coefficients at time t for m=1, . . . , M and τ=0, . . ., L−1, where L is the length of each filter. The filter output is givenby

$\begin{matrix}{{y\lbrack t\rbrack} = {\sum\limits_{m = 1}^{M}{\sum\limits_{\tau = 0}^{L - 1}{{w_{m}\lbrack {\tau;t} \rbrack}{{x_{r,m}\lbrack {t - \tau} \rbrack}.}}}}} & (14)\end{matrix}$

In some embodiments, Equation (14) can be written as a matrix-vectormultiplication,

y[t]=w[t]x _(r) [t],  (15)

where x _(r) ^(T)[t]=[x_(r) ^(T)[t], x_(r) ^(T)[t−1], . . . , x_(r)^(T)[t−L+1]] and w∈

^(2xLM).

In the experiments in this work, we update the filter coefficients withthe least mean squares (LMS) algorithm. The MIBO update is given by

w[t+1]← w[t]+μ((g*x _(e))[t]−y[t]) x _(r) ^(T) [t],  (16)

where μ is a tunable step size parameter.

The SIBO updates have the same form except that each audio filter isadapted independently:

w[t+1]← w[t]+μ((g _(m) *x _(e))[t]−w _(m) [t]) x _(r,m) ^(T) [t].  (17)

With additional reference to FIG. 1 , other possible remote signalsources 102 are envisioned, alone or as combined with other remotesignal sources, capable of generating electronic signals that representsound that is also being passed as propagated audio signals through theair. In some embodiments, the remote source signals could be anylow-noise mixture of the talkers of interest. For example, the output ofa source separation or enhancement algorithm (such as independent vectoranalysis or a deep neural network) could be connected to the input ofthe MIBO audio filter 350B. The advantage of the proposed approach isthat the input to the adaptive filters can be any combination of thesources of interest. Thus, a source separation algorithm could be usefuleven if it suffers from a permutation ambiguity, that is, if there iscrosstalk in its output.

As another example, the outputs of a set of beamformers, such as thoseused in many commercial teleconferencing audio capture systems, could beused as inputs to the MIBO audio filter 350B. The adaptive filter wouldadd utility to the listening system 100 by restoring spatial cues andcompensating for any spectral distortion caused by the beamformer.Furthermore, talkers would be able to move between the beams and theMIBO audio filter 350B would continue to produce the correct spatialcues without extensive adaptation, as the MIBO audio filter 350B adaptsbased on the beams from beamformer microphones, not the talkerpositions.

The listening system 100, which employs the disclosed adaptivefiltering, was evaluated experimentally using a binaural dummy head inan acoustically treated laboratory (T₆₀≈250 ms). Speech signals wereeither produced by a human talker or derived from the VCTK dataset andplayed back over loudspeakers. Each talker was recorded separately andthe recordings were mixed to simulate simultaneous speech. For eachexperiment, the adaptive filter coefficients were computed based on themixture but applied separately to each source recording in order totrack the effect of the system on each component signal. The filterswere about 20 ms in length and were designed to be transparent for thesource(s) of interest (g_(n)[t]=δ[t]). The step size μ was tunedmanually. For each experiment, the wideband signal-to-noise ratio (SNR)was computed after high-pass filtering at 200 Hz to exclude mechanicalnoise in the laboratory. The apparent interaural time delays (ITD) werecomputed by finding the peak of the cross-correlation within overlapping5 second windows.

The experiments are summarized in FIGS. 4A-4B and Table 1. FIG. 4A is ablock diagram of an experimental setup with a moving human talker withmultiple signal sources and a non-moving listener according to anembodiment. FIG. 4B is a block diagram of the experimental setup withthree loudspeaker signal sources and a moving listener according to anembodiment. Table 1 illustrates wideband SNR in decibels for acousticexperiments. Input and filter output SNRs are measured at the left earfor experimental purposes.

TABLE 1 Sources Listener RMs Input Remote MIBO SIBO 1 moving Still Lapel−4.3 9.3 — 7.2 3 still Moving Near 0.8 21.3 18.8 18.2 3 still Moving Far0.7 12.0  9.8 11.7

FIG. 5 is a set of graphs illustrating filter performance for a singlemoving talker according to an embodiment. The top graph illustrates SNRat the left ear. The bottom graph illustrates apparent ITD of the targetsource in the filter output. The dotted curve shows the true ITD. In thefirst experiment, which simulates the typical use case for remotemicrophone systems today, a lapel microphone was worn by a moving humantalker. Noise was produced by seven loudspeakers placed around the room.The human subject followed the same route during each source recordingso that sound and motion are roughly synchronized. The top plot of FIG.5 shows the wideband input and output SNR at the left ear and the inputSNR at the remote microphone. The SNR varied as the talker moved amongthe interfering loudspeakers. The output SNR closely tracks the remotemicrophone input SNR, as expected. The bottom plot shows the apparentITD of the target speech at the output of the binaural filter comparedto that of the clean signal at the ears. The adaptive filter is able totrack the spatial cues as the talker moves from center to left to rightand back again. Thus, the filter output matches the SNR of the remotemicrophone and the spatial cues of the earpieces.

FIGS. 6A-6D is a set of graphs illustrating apparent interaural timedelays (ITDs) from either near signal sources or far signal sourcesvaried between the filters of FIG. 4A and FIG. 4B according to variousembodiments. A second experiment simulated a multiple-talker applicationwith a moving listener. The dummy head was placed on a motorizedturntable, which made one rotation during the one minute recording,starting from the FIG. 4B scenario. Loudspeakers simulated three talkersof interest and five unwanted speech sources. The remote microphoneswere three end-address cardioid vocal microphones. First, to simulatepersonal remote microphones, each remote microphone was placed about 30cm in front of its corresponding speaker. Second, to simulate an array,the three remote microphones were grouped together about 60 cm from thetalkers.

The SNR results are shown in Table 1 and the apparent ITDs are shown inFIGS. 6A-6B for the four combinations of filter type and microphoneplacement. When the RMs were close to the talkers, the SIBO filters andMIBO filter both performed well, with the MIBO filter achieving aslightly higher SNR and better preserving interaural cues. When theremote microphones were farther from the talkers, the MIBO filter stillpreserved interaural cues but also reproduced more unwanted noise. TheSIBO filters were better at rejecting noise, but crosstalk betweensources caused distortion of the interaural cues.

FIG. 7 is a block diagram illustrating an exemplary listening system 700(or electronic assembly) involving remote microphones that areco-located in an area and associated with a group conversation accordingvarious embodiments. In these embodiments, the listening system 700includes several listening devices such as the listening devices 110,120 already discussed (see FIG. 1 ), e.g., hearing aids, cochlearimplants, or ear buds of different kinds, as well as microphone devices720, e.g., mobiles devices such as smartphones, tablets, and othermobile computing devices that include an integrated microphone. In thissense, the listening devices 110, 120 may also be considered to bemicrophone devices 720. For example, a microphone device may be anin-ear microphone integrated within an ear playback device or may be amicrophone integrated within a mobile device of the user.

In these embodiments, the microphone devices 720 may include a firstmicrophone device 720B owned (e.g., carried) by a first user, a secondmicrophone device 720B owned (e.g., carried) by a second user, and athird microphone device 720C owned (e.g., carried) by a third user thatare co-located in an area and to generate a first electronic signal, asecond electronic signal, and a third electronic signal, respectively.Thus, to be “co-located” connotates within audio detection range forspeech and these electronic signals correspond to sound within suchaudio detection range.

In at least some embodiments, each of these microphone devices 720(illustrated in detail with respect to the third microphone device 720Cfor purposes of explanation) includes a microphone 722, a playbackdevice 726 (e.g., speaker, paired speaker, paired ear buds, paired earphones, or the like), a processing device 750, an audio detector 755that includes control logic 757, and a communication interface 770. Insome embodiments, as discussed with reference to FIG. 1 , at least aportion of the processing device 750 may be incorporated within one ofthe listening devices 110, 120. Thus, in at least one embodiment, thefirst microphone device 720A (or other of the disclosed microphonedevices) is an audio detection system that includes at least a portionof the control logic 757 (implementing audio detecting) and at least aportion of the processing device 750. While only three users and theirrespective microphone devices 720A, 720B, 720C are illustrated, thisdisclosure contemplates N users and N microphone devices as will bediscussed in more detail.

In various embodiments, the pairing of a playback device 726 or one ofthe listening devices 110, 120 to one or more of the microphone devices720 may be performed over a respective communication interface 770 of amicrophone device over a network 715. In different embodiments, thenetwork 715 is a personal area network (PAN), a Body Area Network (BAN),or a local area network (LAN). The technology used for such pairing andcommunication over the network 715 may include technologies such asBluetooth®, Near-Field Communication (NFC), Wi-Fi®, Zigbee®, or the likesimilar protocol that enables generation of a personal area network(PAN) 715. It is envisioned that a future network 715 will be configuredto handle communication at the speed of sound to facilitate theaudio-based intercommunication of the microphone devices 720 and thelistening devices 110, 120.

In various embodiments, the audio detector 755 may be a voice activitydetector that is coupled to the microphone 722 (or other voice detectionhardware) and is able to detect speech from the different users as anaudio signal. An audio signal is to be distinguished herein from noisein an ambient environment of the users. Audio signals may be combined(e.g., mixed) with such noise to generate the first, second, and thirdelectronics signals from the three users of FIG. 7 . In someembodiments, one or more of the first microphone device 720A, the secondmicrophone device 720B, and the third microphone device 720C areinstantiated within a single audio detection device (e.g., an audio puckand with an array of microphones pointed in different directions) thatuses beamforming to detect audio signals and crosstalk audio signalsfrom multiple sources. Other microphone arrays are envisioned as well.

Voice activity detection (or VAD) is a technique in which presence orabsence of human speech is detected, e.g., identifying or classifyingaudio as human speech as opposed to other ambient sounds or noise.Although VAD is commonly performed using logic (e.g., the control logic757) coupled to a microphone, VAD may also be employed in conjunctionwith an accelerometer (such as present in AirPods®) or a vibrationdetection device attached to the throat area of a user. Thus, a VADdevice may be employed as a part of an intelligent or smart audiodetection device or system, which can be embedded within any singledevice or a combination of ear listening devices 110, 120 or microphonedevices 720 discussed herein in various embodiments.

In embodiments, VAD is used to trigger one or more processes, as will bediscussed in more detail, performed by the processing device 750. Forexample, VAD can be applied in speech-controlled applications anddevices like smartphones (among other smart devices that are employedhomes, offices, and vehicles), which can be operated by using speechcommands. Further, some of the main uses of VAD are in speakerdiarization, speech coding, and speech recognition. VAD can facilitatespeech processing and be used to deactivate some processes duringnon-speech section of an audio session or during a speech section of theaudio session when in a group conversation environment, as will bediscussed.

Oftentimes, the intelligibility of group conversations in noisyenvironments such as a restaurant, networking meeting, or the like ispoor. In embodiments, the listening system 700 is employed to improvethe intelligibility by aggregating signals from the mobile and wearabledevices, referred to herein as the microphone devices 720 and listeningdevices 110, 120, of the participants (e.g., Users 1-3). In disclosedembodiments, the listening system 700 uses a microphone device 720placed near each talker to capture a low-noise speech signal. Instead ofmuting inactive microphones, which can be distracting and lead to theloss of some speech at transitions from muting, the processing device750 can employ adaptive crosstalk cancellation filters to remove thespeech of other users, including delayed auditory feedback of thelistener's own speech. Next, the processing device 750 can employadaptive spatialization filters that process the low-noise signals togenerate binaural outputs that match the spatial and spectral cues atthe ears of each listener. These adaptive spatialization filters werediscussed in detail with reference to FIGS. 1-6D.

Conventional listening devices, such as hearing aids, work poorly innoisy environments because their microphones have the same SNR as theunaided ears. However, a network of several microphone-equipped devicesspread around the group could achieve greater spatial diversity,providing better noise reduction performance than any single device.Herein is proposed a group conversation enhancement system according tovarious embodiments that aggregates signals from the mobile and wearabledevices of conversation participants. Wireless sensor networks anddistributed microphone arrays have been proposed for spatial soundacquisition. For example, mobile phones near talkers can help fixmicrophone arrays to transcribe a meeting. A distributed beamformingalgorithm for nonmoving hearing aid networks can be employed. Real-worldhuman listening enhancement systems pose additional challenges. Thesechallenges include, for example, that the system is to operate in realtime with imperceptible delay, generally several milliseconds, thesystem is to preserve the spatial cues that humans use to localize andseparate sounds, such as interaural time and level differences, and thesystem is to contend with continuous motion of both sound sources andmicrophones.

In embodiments, modern listening devices are paired with a wirelessremote microphone (RM) accessory that transmits low-noise speechdirectly from a talker to the ears of a listener and low-latencywireless standards may soon allow smartphones to act as convenient RMs.Well-placed RMs can greatly improve intelligibility of a single distanttalker in noise, but current systems are unsuitable for groupconversations because they support only one talker at a time and do notpreserve interaural cues. Some researchers have proposed applyingspatialization filters to RM signals based on the estimated direction ofarrival. As discussed with reference to FIGS. 1-6D, earpiece microphonesare used as reference signals for an adaptive filter, eliminating theneed for explicit source localization. This approach may also beemployed in binaural beamforming systems, either using earpieces aloneor in combination with external microphones.

Starting with FIG. 7 , this disclosure extends the adaptivespatialization techniques of FIGS. 1-6D to address the challenges ofclose group conversations. Because the devices are closely spaced, theremay exist significant crosstalk between microphones, which can causedistortion of spatial cues and delayed auditory feedback of thelistener's own speech, which can be disturbing and impede speechproduction. A common solution to crosstalk and own-speech echo is todisable all but one microphone at a time. However, frequent muting andunmuting of microphones can be distracting in a fast-paced groupconversation and, if there is delay in the voice activity detection(VAD), can cause listeners to miss the first few syllables from a talkerthat was previously silent. Instead, the listening system 700 may beconfigured with crosstalk cancellation filters to suppress echoes, e.g.crosstalk. The listening system 700 provides a more natural listeningexperience in group conversations that may include frequentinterruptions and double-talk.

A further challenge in group conversations is that users moveconstantly, causing acoustic channel parameters to change during andbetween utterances. In embodiments, therefore, the processing device 750continuously updates the adaptive filters while in use. In this work,stationary mobile devices are used as the remote signal sources becausetheir acoustic channel parameters are more stable than those of wearablemicrophones, allowing the adaptive filters to converge more quickly asusers move. Meanwhile, earpieces and other wearable devices that movewith the users are helpful for VAD and as references for trackinginteraural cues.

Consider a group of N≥2 talkers and N remote microphones, as shown inFIG. 1 , numbered such that RM n is placed near talker n for n=1, . . ., N Let s_(n)[t] be the discrete-time speech signal from talker n ascaptured by RM n. Consider a short time interval during which theacoustic channels from talkers to microphones can be consideredtime-invariant. Let a_(r,m,n)[τ] be the relative impulse response (RIR)describing the acoustic channel from talker n to RM m relative to RM nand let z_(r,m)[t] be the ambient noise at RM m. Then the mixturex_(r,m)[t] captured by RM m is given by

x _(r,m) [t]=Σ _(n=1) ^(N)(a _(r,m,n) *s _(n) [t])+z _(r,m) [t],m=1, . .. ,N,  (18)

where * denotes linear convolution. Note that because each s_(n) isdefined with respect to RM n, each a_(r,m,n) [τ] is the unit impulseδ[τ]. If each RM is placed close to its corresponding talker, then theRIRs of the other microphones should be well modeled by causal filtersexecuted by the processing device 750.

In addition to the remote microphones, in embodiments, each user wears abinaural listening device containing a left microphone (e.g., microphone122) and a right microphone (e.g., microphone 112). Leta_(e,m,n)[τ]=[a_(e,m,n) ^(left)[τ],a_(e,m,n) ^(right)[τ]]^(T) be thevector of RIRs from talker n to the left and right ears of listener mfor m,n=1, . . . , N and let z_(e,m)[t]∈

² be the ambient noise at those earpiece microphones 112, 122. Then themixture x_(e,m)[t]=[x_(e,m) ^(left)[t],x_(e,m) ^(right)[t]]^(T) capturedby the earpieces of listener m is given by

$\begin{matrix}{{{x_{e,m}\lbrack t\rbrack} = {\sum\limits_{n = 1}^{N}{( {a_{e,m,n}\bigstar s_{n}} )\lbrack t\rbrack}}},{+ {z_{e,m}\lbrack t\rbrack}},{m = 1},\ldots,{N.}} & (19)\end{matrix}$

In at least some embodiments, the listening system 700 performsconversation enhancement by removing ambient noise and own-speech echoeswhile preserving the speech of other talkers with correct spatial cues.The removal of ambient noise, for example, may be performed in removingthe own-speech echoes and ambient noise mixed with those own-speechechoes, providing a significant improvement in the clarity of thedesired audio signal being received from other microphone devices 720(of other talkers). The desired output y_(m)[t]=[y_(m) ^(left)[t], y_(m)^(right)[t]]^(T) for listener m is given by

$\begin{matrix}{{{y_{m}\lbrack t\rbrack} = {\sum\limits_{n \neq m}{( {a_{e,m,n}\bigstar s_{n}} )\lbrack t\rbrack}}},{+ {z_{e,m}\lbrack t\rbrack}},{m = 1},\ldots,{N.}} & (20)\end{matrix}$

This binaural output may be amplified, equalized, compressed, orotherwise processed before it is presented to the listener, which mayinclude the spatialization filtering discussed herein. In someembodiments, the enhanced signals y_(m) are mixed with the earpiecesignals x_(e,m) to better preserve situational awareness. Because thespatialized signals will be mixed with live signals—eitherelectronically within the device or acoustically in the ear—thepost-cancellation processing endeavors to generate an output withnear-zero delay relative to the live signal at the corresponding ear.

FIG. 8 is a simplified block diagram of an example of crosstalkcancellation as between two remote microphones associated with two ofthe users illustrated in FIG. 7 according some embodiments. FIG. 10 is asimplified block diagram of an example crosstalk cancellation as betweenthree remote microphones associated with three of the users illustratedin FIG. 7 according some embodiments. In various embodiments, one of theprocessing devices 750 (e.g., of user m) performs the processing asillustrated in FIG. 8 (for User1 and User2) and as illustrated in FIG.10 (for all three users) from the perspective of User2. Specifically,the processed audio in both FIG. 8 and FIG. 10 is delivered to playbackdevice(s) of User2 that owns the second microphone device 720B. (FIG. 8and FIG. 10 will be discussed in more detail later.) In embodiments,this processing is partitioned into two main stages. The first mainstage is crosstalk cancellation to improve separation and suppressechoes of the listener's own speech. The second main stage isspatialization to preserve realistic spatial and acoustic cues, whichwas discussed more thoroughly with reference to FIGS. 1-6D.

In various embodiments, both crosstalk suppression and spatializationrely on accurate VAD to determine or identify which users are speaking.Wearable devices are attractive for VAD because they are physicallyattached to users. Earpieces can use hardware features such asbone-conduction microphones to perform reliable VAD even in strongnoise. In experiments, two wearable VAD implementations were compared,including a more-reliable VAD using headset microphones and aless-reliable VAD using lapel microphones. Speech was detected using amultivariate Gaussian likelihood ratio test in the short-time Fouriertransform domain. Second-order statistics were estimated using trainingdata and time-frequency log-likelihood ratios were averaged from 0 to 1kHz in half-second time windows. The resulting statistics were comparedagainst a manually-tuned threshold. The headset and lapel VADs were 90%and 82% accurate, respectively, in a one-at-a-time conversation withmoving talkers.

In a group conversation, the talkers are close together so that eachmicrophone of each microphone device 720 captures speech from all users.Instead of muting the microphones of users who are not speaking, whichcould be distracting and cause listeners to miss parts of theconversation, the processing device 750 is configured to keepmicrophones on at all times, but uses adaptive cancellation filters toremove crosstalk. The processed microphone signals ŝ_(n)[t] are given by

$\begin{matrix}{{{\hat{s}}_{n}\lbrack t\rbrack} = \{ {\begin{matrix}{{x_{r,n}\lbrack t\rbrack},} & {{{if}{user}n{is}{talking}},} \\{{{x_{r,n}\lbrack t\rbrack} - {{\sum}_{m \neq n}{( {u_{n,m}\bigstar x_{r,m}} )\lbrack t\rbrack}}},} & {otherwise}\end{matrix},} } & (21)\end{matrix}$

for n=1, . . . , N, where each u_(n,m) is a finite-impulse-responsefilter. In a low-noise environment, each u_(n,m) models thecorresponding RIR a_(r,n,m). In embodiments, the filter is disabled whenuser n is speaking to prevent target signal cancellation; merely pausingadaptation was found to be ineffective, presumably due to motion. Notethat the filter cancelling source m at microphone n remains active evenwhen user m is quiet in order to avoid echoes in case of false negativesfrom the VAD. When user m is quiet, the filter will help to suppressnoise from the direction of that user.

Because human talkers move frequently, the crosstalk cancellationfilters are updated continuously when the crosstalk cancellation filtersare active. When user n is quiet, ŝ_(n)[t] is a linear prediction errorsignal with x_(r,n)[t] as the reference signal. The filters are adaptedto perform crosstalk cancellation optimization as

$\begin{matrix}{{\min\limits_{{{\{ u_{n,m}\}}m} \neq n}{{\mathbb{E}}\lbrack {❘{{\hat{s}}_{n}\lbrack t\rbrack}❘}^{2} \rbrack}},} & (22)\end{matrix}$

where

denotes statistical expectation. In experiments, Equation (22) wasiteratively solved using the normalized least-mean-squares (NLMS)algorithm with first-order prewhitening.

It is instructive to compare the behavior of the cancellation system tothat of a muting system with an imperfect VAD. Consider N=2 users andzero ambient noise. When User1 is speaking and User2 is quiet, thecancellation filter converges to the Wiener solutionu_(2,1)[τ]=a_(2,1)[τ] so that User1 is perfectly cancelled and ŝ₂ [t]=0,just as in a muting system. Suppose that User2 interrupts and the VAD ofUser2 does not immediately detect the interruption. In the mutingsystem, speech of User2 would be inaudible. In the listening system 700,the output immediately following the interruption can be expressed as:

ŝ ₂ [t]=x _(r,2) [t]−(u _(2,1) *x _(r,1))[t]  (23)

ŝ ₂ [t]=((a _(2,1) −a _(2,1))*s ₁)[t]+((δ−a _(2,1) *a _(1,2))*s₂)[t]  (24)

ŝ ₂ [t]=((δ−a _(2,1) *a _(1,2))*s ₂)[t]  (25)

The speech from User1 is still cancelled correctly and the speech fromTalker2 is audible but distorted. The severity of the distortion dependson the crosstalk channels between microphones. With well-positioneddirectional microphones, the RIRs a_(2,1) and a_(1,2) should both havemagnitude responses much smaller than one (“1”) so that the distortionhas little effect on s₂. In a system with strong crosstalk, such as acompact microphone array, the listening system 700 may cause distortion,in which case, a linearly constrained beamformer may be moreappropriate.

With additional reference to FIG. 8 , the control logic 757 (e.g., ofthe audio detector 755 of the first microphone device 720A) performsvoice activity detection, including to detect no first audio signal fromthe first microphone device 720A and detect a crosstalk audio signalfrom a direction of the second microphone device 720B that matches thesecond electronic signal (x_(r,2)). The lack of audio signal can meanthat User1 is quiet (the VAD does not detect speech), so there still maybe noise detected by the first microphone device 720A. The term“matches” here refers to being substantially the same audio signal,except for some differences in associated noise and delay. Thus, in atleast some embodiments, the first electronic signal (x_(r,1)) includes amixture that includes something closely resembling the crosstalk audiosignal (e.g., x_(r,2)) and any ambient noise that is detected.

In these embodiments, an ear playback device (e.g., receiver or speaker)such as the ear playback device 116 of the first ear listening device110 or the ear playback device 126 of the second ear listening device120 is associated (e.g., paired) with the second microphone device 720B.In these embodiments, the processing device 750 (e.g., of the secondmicrophone device 720B) is communicatively coupled to the first andsecond microphone devices 720A and 720B, to the control logic 757, andto the ear playback device, e.g., via the network 715. In at least oneembodiment, the processing device 750 receives the first electronicsignal and the second electronic signal and performs crosstalkcancellation (e.g., the application of crosstalk filters 802) to removethe second electronic signal from the first electronic signal togenerate a cleansed first electronic signal (ŝ₁). In embodiments, toremove the second electronic signal, the processing device 750 appliesan adaptive cancellation filter to the first electronic signal withrespect to the second electronic signal, for example.

As was mentioned in the N-user embodiments, the processing device 750disables the cancellation filter when user n is speaking to preventtarget signal cancellation. Applying this concept to the specifictwo-user example of FIG. 8 , recall that unlike in many multi-talkersystems that only activate one microphone at a time, the processingdevice 750 of the listening system 700 is configured to leave themicrophones on even when their respective talkers are quiet (e.g., VADdetects no speech). When a talker is quiet, the microphone of the talkerruns the crosstalk cancellation algorithm instead of shutting off. Whenthe talker starts talking, the crosstalk cancellation is disabled.

Thus, by way of example in FIG. 8 , according to at least someembodiments, when User2 is talking and User1 is quiet, the illustratedcrosstalk cancellation is active. This allows User2 to talk withouthearing annoying crosstalk (e.g., which sounds like an echo), but stillallows User1 to interrupt at any time without waiting for the microphoneto reactivate. When User1 is talking and User2 is quiet, the illustratedcrosstalk cancellation is disabled, as unnecessary. Thus, in thisexample, in response to the control logic 757 detecting the first audiosignal x_(r,1) indicative of speech from the first user (User1), theprocessing device 751 disables the adaptive cancellation filteraccording to an embodiment. In this situation, the crosstalkcancellation may be performed on behalf of User1 instead of User2 ifUser1 is also a listener in the listening system 700 (but thiscancellation is not illustrated). When both users are talking at thesame time, crosstalk cancellation is also off, so User2 may hear someunwanted own-voice echo (if loud enough to reach the microphone ofUser1), but this is unavoidable when disabling the crosstalkcancellation in this scenario. The spatialization filtering can, inthese situations, provide additional contextual processing to improvethe received audio, despite that User2 might hear some crosstalk.

In some embodiments, the processing device 750 further processes thecleansed first electronic signal ŝ₁ to integrate the cleansed firstelectronic signal into an output signal (ŷ_(2,R) and/or ŷ_(2,L)) to theear playback device 116 or 126, e.g., to a receiver of the first earlistening device 110 and/or the second ear listening device 120,respectively. In embodiments, this further processing includes applyingspatialization filters 804A (e.g., at least a first audio filter of aset of audio filters) to the cleansed first electronic signal ŝ₁ with afirst error signal, which is based on an output x_(e2,R) of the firstear microphone 112 of the first ear listening device 110, to generatethe output signal to the first ear playback device 116. In embodiments,this further processing optionally also includes applying spatializationfilters 804B (e.g., at least a second audio filter of the set of audiofilters) to the cleansed first electronic signal ŝ₁ with a second errorsignal, which is based on an output x_(e2,L) of the second earmicrophone 122 of the second ear listening device 120, to generate theoutput signal to the second ear playback device 116.

In various embodiments, with additional specificity, the spatializationfilters process the low-noise source estimates ŝ₁[t], . . . ,ŝ_(N)[t] tomatch the spatial and acoustic cues at the ears of each listener,including interaural time and level differences, spectral shaping, andearly reflections. The binaural output mixture for listener m is givenby

$\begin{matrix}{{{{\hat{y}}_{m}\lbrack t\rbrack} = {\sum\limits_{n \neq m}{( {w_{m,n}\bigstar{\hat{s}}_{n}} )\lbrack t\rbrack}}},} & (26)\end{matrix}$

where each w_(m,n)[τ]∈

² is a casual finite-impulse-response filter. The filters for eachlistener m are updated to solve

$\begin{matrix}{\min\limits_{{{\{ w_{m,n}\}}n} \neq m}{{{\mathbb{E}}\lbrack {❘{{x_{e,m}\lbrack t\rbrack} - {{\hat{y}}_{m}\lbrack t\rbrack}}❘}^{2} \rbrack}.}} & (27)\end{matrix}$

In conducted experiments, this cost function is minimized iterativelyusing the NLMS algorithm with first-order prewhitening. Unlike thecrosstalk cancellation filters, the spatialization filters are alwaysactive, even when their respective users are not speaking. However, eachw_(m,n) is updated only while user n is speaking. If the filters wereupdated continuously, then the filters w_(m,n) would amplify nearbynoise sources during speech pauses.

When multiple users are speaking simultaneously, the spatializationfilter coefficients are updated jointly. They therefore act as amultiple-input, binaural-output (MIBO) filter that maps from inputmixtures to output mixtures. It was shown with reference to FIG. 3B thatan N-input MIBO filter can preserve the spatial cues of up to N sources.MIBO filters do not require that the sources be separated and areunaffected by residual crosstalk, making them well suited for closelyspaced talkers. However, they do rely on accurate VAD: false negativeswould cause them to blend the cues of multiple active talkers, whilefalse positives would cause them to amplify a nearby noise source inplace of the missing talker. The crosstalk cancellation stage thereforehelps to mitigate spatial-cue distortion with an unreliable VAD.

FIG. 9 is a flow chart of a method 900 of crosstalk cancellation asbetween two remote microphone associated with the first and second usersof FIG. 7 according to at least one embodiment. The method 900 may beperformed by processing logic that may comprise hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (such as instructions running on the processor), firmware or acombination thereof. In one embodiment, the processing device 750 of thelistening system performs the method 900. Alternatively, othercomponents of a computing device or cloud server may perform some or allof the operations of the method 900.

Although shown in a particular sequence or order, unless otherwisespecified, the order of the processes can be modified. Thus, theillustrated embodiments should be understood only as examples, and theillustrated processes can be performed in a different order, and someprocesses can be performed in parallel. Additionally, one or moreprocesses can be omitted in various embodiments. Thus, not all processesare required in every embodiment. Other process flows are possible.

At operation 910, the processing logic receives a first electronicsignal from a first microphone device.

At operation 920, the processing logic receives a second electronicsignal from a second microphone device. For example, the secondelectronic signal includes a mixture that includes the first electronicsignal due to crosstalk between the first and second microphone devices.

At operation 930, the processing logic removes the first electronicsignal from the second electronic signal to generate a cleansed secondelectronic signal.

At operation 940, the processing logic processes the cleansed secondelectronic signal to integrate the cleansed second electronic signalinto an output signal to the ear playback device.

With additional reference to FIG. 10 , and as an extension to theembodiment of FIG. 8 , the third microphone device 720C is now also inplay, as User3 has joined the group conversation. In embodiments, thethird microphone device is co-located with the first and secondmicrophone devices 720A and 720B. In these embodiments, the thirdmicrophone device 720C generates a third electronic signal correspondingto sound detected within the audio detection range and communicativelycoupled to the processing device. In embodiments, the control logic 757is further to detect a second crosstalk audio signal from a direction ofthe third microphone device 720C that matches the third electronicsignal, for example.

Thus, in this embodiment, the first electronic signal x_(r,1) includes amixture that includes the first crosstalk audio signal (from the firstmicrophone device 720A) and the second crosstalk audio signal (from thethird microphone device 720B). In embodiments, the first electronicsignal also includes some ambient noise. In at least one embodiment, theprocessing device 750 receives and removes, e.g., using crosstalkcancellation filters 1002A, both the second electronic signal x_(r,2)and the third electronic signal x_(r,3) from the first electronic signalx_(r,1) to generate the cleansed first electronic signal ŝ₁. Inembodiments, to remove the second electronic signal, the processingdevice 750 applies an adaptive cancellation filter to the firstelectronic signal with respect to the second electronic signal and toremove the second crosstalk signal, the processing device 750 appliesthe adaptive cancellation filter to the first electronic signal withrespect to the third electronic signal. (As discussed previously,however, in response to detecting that User1 starts talking, theprocessing device 750 disables the adaptive cancellation filter in someembodiments.) In at least some embodiments, this processing includesapplying spatialization filters 1004A (e.g., at least a first audiofilter of a set of audio filters) to the cleansed first electronicsignal ŝ₁ with a first error signal, which is based on an outputx_(e2,R) of the first ear microphone 112 of the first ear listeningdevice 110, to generate a first output signal to the first ear playbackdevice 116, for example.

In at least some embodiments, because of the third user, additionalcrosstalk may be captured within the third electronic signal (x_(r,3))that substantially includes a that includes the first electronic signalx_(r,1) and the second electronic signal x_(r,2). In these embodiments,the processing device 750 receives and removes, e.g., using crosstalkcancellation filters 1002B, both the first electronic signal x_(r,1) andthe second electronic signal x_(r,2) from the third electronic signalx_(r,3) to generate a cleansed third electronic signal ŝ₃. Inembodiments, to remove the second electronic signal, the processingdevice 750 applies an adaptive cancellation filter to the thirdelectronic signal with respect to the second electronic signal and toremove the second crosstalk signal, the processing device 750 appliesthe adaptive cancellation filter to the third electronic signal withrespect to the second electronic signal. (As discussed previously,however, in response to detecting that User3 starts talking, theprocessing device 750 disables the adaptive cancellation filter in someembodiments.) In at least some embodiments, this processing includesapplying spatialization filters 1004B (e.g., at least a second audiofilter of a set of audio filters) to the cleansed third electronicsignal ŝ₃ with a second error signal, which is based on an outputx_(e2,L) of the second ear microphone 122 of the second ear listeningdevice 120, to generate a second output signal to the second earplayback device 126, for example.

EXPERIMENTS: The listening system 700 was evaluated with three livehuman subjects seated around a table in an acoustically treatedlaboratory (T₆₀≈150 ms). Each subject wore an omnidirectional lavaliermicrophone behind each ear to simulate behind-the-ear hearing aids.Another such microphone was affixed to the table in front of eachsubject to simulate a mobile phone. Each subject also wore lapel andheadset microphones, which were used only for VAD. Noise was produced bya set of six loudspeakers playing clips from the VCTK speech corpus ofthe Centre for Speech Technology Voice Cloning Toolkit.

To simulate a group conversation, the subjects took turns reading from ascript for 60 seconds. In one recording, the subjects looked straightahead and tried not to move. In another, they turned to look at eachother and gestured while speaking. A third recording with moderatemotion was used for VAD training. To quantify the input and output SNRof the system, the noise was recorded separately and added to the livespeech recordings. The noise was therefore recorded with a differentmotion pattern than the live speech. Likewise, double-talk andtriple-talk mixtures were simulated by combining separate recordings.The microphones were sampled synchronously at 48 kHz and processed at 16kHz. The results shown here are for the left-ear output of a listeningdevice of one user.

The SNR improvement of the proposed conversation enhancement system isillustrated in FIG. 11 . Because the listening system 700 does notperform beamforming or other noise reduction processing, the SNRimprovement depends strongly on the placement of the remote microphones.The smartphone-like tabletop microphones had higher input SNR and lowercrosstalk than the earpiece and lapel microphones, especially at highfrequencies. Using the MIBO spatialization filters of FIG. 3B withoutcrosstalk cancellation improves the high-frequency SNR at the left earby up to 10 dB. The crosstalk filter helps to further suppress noisewhen nearby users are not speaking, providing another 2-5 dB benefit toSNR. A conventional remote microphone system that mutes all but onemicrophone achieves the best average output SNR, but is too distractingto be practical. The plot of FIG. 11 shows performance using theheadset-based VAD for nonmoving subjects. The results for otherexperimental conditions were similar and so are not reported. VADaccuracy and user motion appear to have little effect on ambient noisereduction.

FIG. 12A is a graph illustrating own-speech crosstalk suppressionperformance at a left earpiece of a listener using a head microphoneadapted to perform voice activity detection (VAD) according toexperimental embodiments. FIG. 12B is a graph illustrating own-speechcrosstalk suppression performance at a left earpiece of a listener usinga lapel microphone adapted to perform VAD according to experimentalembodiments. Thus, FIGS. 12A-12B show the crosstalk reductionperformance of the system for the listener's own speech. The curves showthe crosstalk level relative to the direct acoustic path to theearpiece. Because the users are seated close together, the own-speechcrosstalk in the baseline spatialization-only system is just 2-5 dBweaker than the direct path. The crosstalk cancellation filters wereable to suppress own-speech echoes by up to 15 dB more than the baselinesystem, but their performance depends on talker motion and on VADaccuracy. The residual crosstalk levels for moving talkers (solidcurves) are higher than those for stationary talkers (dashed curves),especially at high frequencies for which source positions may suddenlychange by multiple acoustic wave-lengths. Echo suppression was alsoworse for the less-reliable lapel-based VAD (FIG. 12B) compared to themore-reliable headset-based VAD (FIG. 12A). The performance of themuting system depends entirely upon VAD performance. With the reliableVAD, the muting system removed virtually all echoes; with the unreliableVAD, the muting system performed little better than the cancellationsystem at most frequencies in the motion experiment.

One can evaluate spatialization performance by comparing the interauralcues of the system output to the cues of the noise-free speech signalsat the ears. FIG. 13A is a graph illustrating high-frequency interaurallevel differences of other talkers at ears of a listener where subjectstake turns speaking while moving to face each other according toexperimental embodiments. FIG. 13B is a graph illustratinghigh-frequency interaural level differences of other talkers at ears ofa listener simulated with double-talk and triple-talk with subjectsfacing forward according to experimental embodiments. Thus, FIGS.13A-13B illustrate the input and output interaural level differences(ILD) of speech from the two other talkers at the ears of the listener.The ILDs are averaged over 1.5 sec windows from 1-8 kHz and color-codedto show the active talker(s). When the talkers take turns (FIG. 13A),only one spatialization filter adapts at a time. The output cues closelymatch the input cues, even as the listener turns their head. When bothother talkers speak simultaneously (FIG. 13B, 30-42 s), two filtersadapt jointly, preserving the spatial cues of both sources despiteresidual crosstalk. When the listener and talker(s) speak simultaneously(FIG. 13B, 42-60 s), crosstalk cancellation is disabled and thespatialization filters are unable to distinguish the listener's ownspeech from that of the other talkers, so their spatial cues areblended. Thus, a user might have trouble localizing conversationpartners while interrupting them.

FIG. 14 is a block diagram of an example computer system 1400 in whichembodiments of the present disclosure can operate. The system 1400 mayrepresent the mobile device 140 or another device or system to which isreferred or which is capable of executing the embodiment as disclosedherein. The computer system 1400 may include an ordered listing of a setof instructions 1402 that may be executed to cause the computer system1400 to perform any one or more of the methods or computer-basedfunctions disclosed herein. The computer system 1400 may operate as astand-alone device or may be connected to other computer systems orperipheral devices, e.g., by using a network 1410.

In a networked deployment, the computer system 1400 may operate in thecapacity of a server or as a client-user computer in a server-clientuser network environment, or as a peer computer system in a peer-to-peer(or distributed) network environment. The computer system 1400 may alsobe implemented as or incorporated into various devices, such as apersonal computer or a mobile computing device capable of executing aset of instructions 1402 that specify actions to be taken by thatmachine, including and not limited to, accessing the internet or webthrough any form of browser. Further, each of the systems described mayinclude any collection of sub-systems that individually or jointlyexecute a set, or multiple sets, of instructions to perform one or morecomputer functions.

The computer system 1400 may include a memory 1404 on a bus 1420 forcommunicating information. Code operable to cause the computer system toperform any of the acts or operations described herein may be stored inthe memory 1404. The memory 1404 may be a random-access memory,read-only memory, programmable memory, hard disk drive, solid-state diskdrive, or other type of volatile or non-volatile memory or storagedevice.

The computer system 1400 may include a processor 1408, such as a centralprocessing unit (CPU) and/or a graphics processing unit (GPU) and mayinclude additional logic such as the audio detector 755 discussed withreference to FIG. 7 . The processor 1408 may include one or more generalprocessors, digital signal processors, application specific integratedcircuits, field programmable gate arrays, digital circuits, opticalcircuits, analog circuits, combinations thereof, or other now known orlater-developed devices for analyzing and processing data. The processor1408 may implement the set of instructions 1402 or other softwareprogram, such as manually-programmed or computer-generated code forimplementing logical functions. The logical function or system elementdescribed may, among other functions, process and/or convert an analogdata source such as an analog electrical, audio, or video signal, or acombination thereof, to a digital data source for audio-visual purposesor other digital processing purposes such as for compatibility forcomputer processing.

The computer system 1400 may also include a disk (or optical) drive unit1415. The disk drive unit 1415 may include a non-transitorycomputer-readable storage medium 1440 in which one or more sets ofinstructions 1402, e.g., software, can be embedded or stored. Further,the instructions 1402 may perform one or more of the operations asdescribed herein. The instructions 1402 may reside completely, or atleast partially, within the memory 1404 and/or within the processor 1408during execution by the computer system 1400.

The memory 1404 and the processor 1408 also may include non-transitorycomputer-readable media as discussed above. A “computer-readablemedium,” “computer-readable storage medium,” “machine readable medium,”“propagated-signal medium,” and/or “signal-bearing medium” may includeany device that includes, stores, communicates, propagates, ortransports software for use by or in connection with an instructionexecutable system, apparatus, or device. The machine-readable medium mayselectively be, but not limited to, an electronic, magnetic, optical,electromagnetic, infrared, or semiconductor system, apparatus, device,or propagation medium.

Additionally, the computer system 1400 may include an input device 1425,such as a keyboard or mouse, configured for a user to interact with anyof the components of system 1400. It may further include a display 1430,such as a liquid crystal display (LCD), a cathode ray tube (CRT), or anyother display suitable for conveying information. The display 1430 mayact as an interface for the user to see the functioning of the processor1408, or specifically as an interface with the software stored in thememory 1404 or the drive unit 1415.

The computer system 1400 may include a communication interface 1436 thatenables communications via the communications network 1410. Thecommunications network 1410 may include wired networks, wirelessnetworks, or combinations thereof. The communication interface 1436network may enable communications via a number of communicationstandards, such as 802.11, 802.17, 802.20, WiMax, cellular telephonestandards, or other communication standards.

Accordingly, the method and system may be realized in hardware,software, or a combination of hardware and software. The method andsystem may be realized in a centralized fashion in at least one computersystem or in a distributed fashion where different elements are spreadacross several interconnected computer systems. A computer system orother apparatus adapted for carrying out the methods described herein issuited to the present disclosure. A typical combination of hardware andsoftware may be a general-purpose computer system with a computerprogram that, when being loaded and executed, controls the computersystem such that it carries out the methods described herein. Such aprogrammed computer may be considered a special-purpose computer.

The method and system may also be embedded in a computer programproduct, which includes all the features enabling the implementation ofthe operations described herein and which, when loaded in a computersystem, is able to carry out these operations. Computer program in thepresent context means any expression, in any language, code or notation,of a set of instructions intended to cause a system having aninformation processing capability to perform a particular function,either directly or after either or both of the following: a) conversionto another language, code or notation; b) reproduction in a differentmaterial form.

The disclosure also relates to an apparatus for performing theoperations herein. This apparatus can be specially constructed for theintended purposes, or it can include a general purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program can be stored in a computerreadable storage medium, such as, but not limited to, any type of diskincluding floppy disks, optical disks, CD-ROMs, and magnetic-opticaldisks, read-only memories (ROMs), random access memories (RAMs), EPROMs,EEPROMs, magnetic or optical cards, SD-cards, solid-state drives, or anytype of media suitable for storing electronic instructions, each coupledto a computer system bus.

The algorithms, operations, and displays presented herein are notinherently related to any particular computer or other apparatus.Various general purpose systems can be used with programs in accordancewith the teachings herein, or it can prove convenient to construct amore specialized apparatus to perform the method. The structure for avariety of these systems will appear as set forth in the descriptionbelow. In addition, the disclosure is not described with reference toany particular programming language. It will be appreciated that avariety of programming languages can be used to implement the teachingsof the disclosure as described herein.

The disclosure can be provided as a computer program product, orsoftware, that can include a machine-readable medium having storedthereon instructions, which can be used to program a computer system (orother electronic devices) to perform a process according to thedisclosure. A machine-readable medium includes any mechanism for storinginformation in a form readable by a machine (e.g., a computer). In someembodiments, a machine-readable (e.g., computer-readable) mediumincludes a machine (e.g., a computer) readable storage medium such as aread only memory (“ROM”), random access memory (“RAM”), magnetic diskstorage media, optical storage media, flash memory components,solid-state memory components, etc.

The words “example” or “exemplary” are used herein to mean serving as anexample, instance, or illustration. Any aspect or design describedherein as “example’ or “exemplary” is not necessarily to be construed aspreferred or advantageous over other aspects or designs. Rather, use ofthe words “example” or “exemplary” is intended to present concepts in aconcrete fashion. As used in this application, the term “or” is intendedto mean an inclusive “or” rather than an exclusive “or.” That is, unlessspecified otherwise, or clear from context, “X includes A or B” isintended to mean any of the natural inclusive permutations. That is, ifX includes A; X includes B; or X includes both A and B, then “X includesA or B” is satisfied under any of the foregoing instances. In addition,the articles “a” and “an” as used in this application and the appendedclaims may generally be construed to mean “one or more” unless specifiedotherwise or clear from context to be directed to a singular form.Moreover, use of the term “an implementation” or “one implementation” or“an embodiment” or “one embodiment” or the like throughout is notintended to mean the same implementation or implementation unlessdescribed as such. One or more implementations or embodiments describedherein may be combined in a particular implementation or embodiment. Theterms “first,” “second,” “third,” “fourth,” etc. as used herein aremeant as labels to distinguish among different elements and may notnecessarily have an ordinal meaning according to their numericaldesignation.

In the foregoing specification, embodiments of the disclosure have beendescribed with reference to specific example embodiments thereof. Itwill be evident that various modifications can be made thereto withoutdeparting from the broader spirit and scope of embodiments of thedisclosure as set forth in the following claims. The specification anddrawings are, accordingly, to be regarded in an illustrative senserather than a restrictive sense.

What is claimed is:
 1. A listening system comprising: a first microphonedevice and a second microphone device that are co-located in an area andto generate a first electronic signal and a second electronic signal,respectively, corresponding to sound within audio detection range;control logic associated with the first microphone device, the controllogic to detect a crosstalk audio signal from a direction of the secondmicrophone device that matches the second electronic signal, wherein thefirst electronic signal comprises a mixture that includes the crosstalkaudio signal; an ear playback device associated with the secondmicrophone device; and a processing device communicatively coupled tothe first and second microphone devices, to the control logic, and tothe ear playback device, the processing device to: receive the firstelectronic signal and the second electronic signal; remove the secondelectronic signal from the first electronic signal to generate acleansed first electronic signal; and process the cleansed firstelectronic signal to integrate the cleansed first electronic signal intoan output signal to the ear playback device.
 2. The listening system ofclaim 1, wherein the second microphone device is one of an in-earmicrophone integrated within the ear playback device or a microphoneintegrated within a mobile device of a user of the ear playback device.3. The listening system of claim 1, wherein the first microphone deviceis an audio detection system that includes at least a portion of thecontrol logic and the processing device.
 4. The listening system ofclaim 1, wherein, to remove the second electronic signal, the processingdevice is to apply an adaptive cancellation filter to the firstelectronic signal with respect to the second electronic signal.
 5. Thelistening system of claim 4, wherein, in response to the control logicidentifying a first audio signal indicative of speech from a first user,the processing device is further to disable the adaptive cancellationfilter.
 6. The listening system of claim 4, wherein the processingdevice is further to continuously update the adaptive cancellationfilter to perform crosstalk cancellation optimization.
 7. The listeningsystem of claim 1, wherein the first and second microphone devices areinstantiated within a single audio detection device that usesbeamforming to detect a first audio signal from the first microphonedevice and the crosstalk audio signal, wherein the audio detectiondevice further comprises at least a portion of the control logic.
 8. Thelistening system of claim 1, further comprising a third microphonedevice that is co-located with the first and second microphone devices,the third microphone device to generate a third electronic signalcorresponding to sound detected within the audio detection range andcommunicatively coupled to the processing device, wherein: the controllogic is further to detect a second crosstalk audio signal from adirection of the third microphone device that matches the thirdelectronic signal, wherein the first electronic signal includes amixture that includes the crosstalk audio signal and the secondcrosstalk audio signal; and the processing device is further to receiveand remove the third electronic signal from the first electronic signalto generate the cleansed first electronic signal.
 9. The listeningsystem of claim 8, wherein, to remove the second crosstalk signal, theprocessing device is to apply an adaptive cancellation filter to thefirst electronic signal with respect to the third electronic signal. 10.The listening system of claim 1, wherein, to process the cleansed firstelectronic signal, the processing device is to apply a set of audiofilters comprising a first audio filter to process the cleansed firstelectronic signal with a first error signal, which is based on an outputof a first ear microphone of the ear playback device, to generate theoutput signal.
 11. An electronic assembly comprising: an ear playbackdevice; a first microphone device associated with the ear playbackdevice; and a processing device communicatively coupled to the firstmicrophone device, to the ear playback device, and to a secondmicrophone device that is co-located in an area with the firstmicrophone device, the processing device to: receive a first electronicsignal from the first microphone device; receive a second electronicsignal from the second microphone device, wherein the second electronicsignal comprises a mixture that includes the first electronic signal dueto crosstalk between the first and second microphone devices; remove thefirst electronic signal from the second electronic signal to generate acleansed second electronic signal; and process the cleansed secondelectronic signal to integrate the cleansed second electronic signalinto an output signal to the ear playback device.
 12. The electronicassembly of claim 11, wherein the first microphone device is integratedwithin the ear playback device and the processing device is integratedwithin a mobile device that is paired with the ear playback device. 13.The electronic assembly of claim 11, wherein the first microphone deviceis integrated within a mobile device that includes the processing deviceand the ear playback device is paired with the mobile device.
 14. Theelectronic assembly of claim 11, further comprising a second earplayback device in which is integrated the second microphone device. 15.The electronic assembly of claim 11, wherein, to remove the firstelectronic signal, the processing device is to apply an adaptivecancellation filter to the second electronic signal with respect to thefirst electronic signal.
 16. The electronic assembly of claim 15,wherein the processing device is further to continuously update theadaptive cancellation filter to perform crosstalk cancellationoptimization.
 17. The electronic assembly of claim 11, furthercomprising a third microphone device that is co-located with the firstand second microphone devices and communicatively coupled to theprocessing device, wherein the processing device is further to: receivea third electronic signal from the third microphone device, wherein thesecond electronic signal further includes the third electronic signaldue to crosstalk between the first and third microphone devices; andremove the third electronic signal from the second electronic signal togenerate the cleansed second electronic signal.
 18. The electronicassembly of claim 17, wherein, to remove the third electronic signal,the processing device is to apply an adaptive cancellation filter to thesecond electronic signal with respect to the third electronic signal.19. The electronic assembly of claim 17, wherein, to process thecleansed second electronic signal, the processing device is to apply aset of audio filters comprising a first audio filter to process thecleansed second electronic signal with a first error signal, which isbased on an output of a first ear microphone of the ear playback device,to generate the output signal.
 20. A non-transitory computer-readablestorage medium storing instructions, which when executed by a processingdevice that is communicatively coupled to an ear playback device, afirst microphone device, and a second microphone device co-located in anarea with the first microphone device, cause the processing device toperform operations comprising: receiving a first electronic signal fromthe first microphone device; receiving a second electronic signal fromthe second microphone device, wherein the second electronic signalcomprises a mixture that includes the first electronic signal due tocrosstalk between the first and second microphone devices; removing thefirst electronic signal from the second electronic signal to generate acleansed second electronic signal; and processing the cleansed secondelectronic signal to integrate the cleansed second electronic signalinto an output signal to the ear playback device.